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Abstract 



We show how binary classification methods developed to work on i.i.d. data can 
be used for solving statistical problems that are seemingly unrelated to classifi- 
cation and concern highly-dependent time series. Specifically, the problems of 
time-series clustering, homogeneity testing and the three-sample problem are ad- 
dressed. The algorithms that we construct for solving these problems are based 
on a new metric between time-series distributions, which can be evaluated using 
binary classification methods. Universal consistency of the proposed algorithms 
is proven under most general assumptions. The theoretical results are illustrated 
with experiments on synthetic and real-world data. 

1 Introduction 

Binary classification is one of the most well-understood problems of machine learning and statistics: 
a wealth of efficient classification algorithms has been developed and applied to a wide range of 
applications. Perhaps one of the reasons for this is that binary classification is conceptually one of 
the simplest statistical learning problems. It is thus natural to try and use it as a building block for 
solving other, more complex, newer or just different problems; in other words, one can try to obtain 
efficient algorithms for different learning problems by reducing them to binary classification. This 
approach has been applied to many different problems, starting with multi-class classification, and 
including regression and ranking (3][T6|, to give just a few examples. However, all of these problems 
are formulated in terms of independent and identically distributed (i.i.d.) samples. This is also the 
assumption underlying the theoretical analysis of most of the classification algorithms. 

In this work we consider learning problems that concern time-series data for which independence 
assumptions do not hold. The series can exhibit arbitrary long-range dependence, and different time- 
series samples may be interdependent as well. Moreover, the learning problems that we consider — 
the three-sample problem, time-series clustering, and homogeneity testing — at first glance seem 
completely unrelated to classification. 

We show how the considered problems can be reduced to binary classification methods. The results 
include asymptotically consistent algorithms, as well as finite-sample analysis. To establish the con- 
sistency of the suggested methods, for clustering and the three-sample problem the only assumption 
that we make on the data is that the distributions generating the samples are stationary ergodic; this 
is one of the weakest assumptions used in statistics. For homogeneity testing we have to make some 
mixing assumptions in order to obtain consistency results (this is indeed unavoidable G2ll ). Mixing 
conditions are also used to obtain finite-sample performance guarantees for the first two problems. 

The proposed approach is based on a new distance between time-series distributions (that is, be- 
tween probability distributions on the space of infinite sequences), which we call telescope distance. 
This distance can be evaluated using binary classification methods, and its finite-sample estimates 
are shown to be asymptotically consistent. Three main building blocks are used to construct the tele- 
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scope distance. The first one is a distance on finite-dimensional marginal distributions. The distance 
we use for this is the following: d-u(P, Q) := s\ip hen \Eph — Eqh\ where P, Q are distributions 
and H is a set of functions. This distance can be estimated using binary classification methods, 
and thus can be used to reduce various statistical problems to the classification problem. This dis- 
tance was previously applied to such statistical problems as homogeneity testing and change-point 
estimation |14|. However, these applications so far have only concerned i.i.d. data, whereas we 
want to work with highly-dependent time series. Thus, the second building block are the recent 
results of EG], that show that empirical estimates of du are consistent (under certain conditions 
on Ti) for arbitrary stationary ergodic distributions. This, however, is not enough: evaluating d-n 
for (stationary ergodic) time-series distributions means measuring the distance between their finite- 
dimensional marginals, and not the distributions themselves. Finally, the third step to construct the 
distance is what we call telescoping. It consists in summing the distances for all the (infinitely many) 
finite-dimensional marginals with decreasing weights. 

We show that the resulting distance (telescope distance) indeed can be consistently estimated based 
on sampling, for arbitrary stationary ergodic distributions. Further, we show how this fact can be 
used to construct consistent algorithms for the considered problems on time series. Thus we can 
harness binary classification methods to solve statistical learning problems concerning time series. 

To illustrate the theoretical results in an experimental setting, we chose the problem of time-series 
clustering, since it is a difficult unsupervised problem which seems most different from the prob- 
lem of binary classification. Experiments on both synthetic and real-world data are provided. The 
real-world setting concerns brain-computer interface (BCI) data, which is a notoriously challenging 
application, and on which the presented algorithm demonstrates competitive performance. 

A related approach to address the problems considered here, as well some related problems about 
stationary ergodic time series, is based on (consistent) empirical estimates of the distributional dis- 
tance, see [23, 21, 13 1 and [8| about the distributional distance. The empirical distance is based on 
counting frequencies of bins of decreasing sizes and "telescoping." A similar telescoping trick is 
used in different problems, e.g. sequence prediction |fl9l . Another related approach to time-series 
analysis involves a different reduction, namely, that to data compression [20 1. 

Organisation. Section [2] is preliminary. In Section [3 we introduce and discuss the telescope dis- 
tance. Section H] explains how this distance can be calculated using binary classification methods. 
Sections [5]and[6|are devoted to the three-sample problem and clustering, respectively. In Section|7] 
under some mixing conditions, we address the problems of homogeneity testing, clustering with 
unknown k, and finite-sample performance guarantees. Section |8]presents experimental evaluation. 



2 Notation and definitions 



Let (X, Fx) be a measurable space (the domain). Time-series (or process) distributions are proba- 
bility measures on the space (X N , J 7 ^) of one-way infinite sequences (where J 7 ^ is the Borel sigma- 
algebra of X n ). We use the abbreviation Xi..h for X\, . . . , X^. All sets and functions introduced 
below (in particular, the sets Hk and their elements) are assumed measurable. 

A distribution p is stationary if p(Xi,j t 6 A) = p(X n+1 _ n+ k 6 A) for all A E J 7 x k > k,n € N 
(with T X k being the sigma-algebra of X k ). A stationary distribution is called (stationary) ergodic 
if lim^oo i Y,i=i.. n -k+i ^Xi.. i+h eA = p{A) p-a.s. for every A e F X k, k e N. (This definition, 
which is more suited for the purposes of this work, is equivalent to the usual one expressed in terms 
of invariant sets, see e.g. 0.) 



3 A distance between time-series distributions 



We start with a distance between distributions on X, and then we will extend it to distributions on 
X°°. For two probability distributions P and Q on (X.J 7 ) and a set T~L of measurable functions on 
X, one can define the distance 

d H {P,Q) ■= sup \E P h-E Q h\. 

hen 

Special cases of this distance are Kolmogorov-Smirnov 1151 . Kantorovich-Rubinstein ifTTIl and 
Fortet-Mourier [7| metrics; the general case has been studied since at least | 
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We will be interested in the cases where d-u(P, Q) — implies P — Q. Note that in this case d-u 
is a metric (the rest of the properties are easy to see). For reasons that will become apparent shortly 
(see Remark below), we will be mainly interested in the sets W that consist of indicator functions. 
In this case we can identify each f E H with the set {x : f(x) = 1} C X and (by a slight abuse 
of notation) write d-u(P, Q) := sup^^ \P(h) ~ Q(h)\. It is easy to check that in this case d-u is 
a metric if and only if H generates T . The latter property is often easy to verify directly. First 
of all, it trivially holds for the case where H is the set of halfspaces in a Euclidean X. It is also 
easy to check that it holds if T~L is the set of halfspaces in the feature space of most commonly used 
kernels (provided the feature space is of the same or higher dimension than the input space), such as 
polynomial and Gaussian kernels. 

Based on d-u we can construct a distance between time-series probability distributions. For two 
time-series distributions pi , p 2 we take the d-u between fc-dimensional marginal distributions of pi 
and p2 for each k E N, and sum them all up with decreasing weights. 

Definition 1 (telescope distance D). For two time series distributions p\ and p 2 on the space 
(X°°, Too) and a sequence of sets of functions H = ("Hi, Ti-i-, • ■ • ) define the telescope distance 

oo 

Dn{pi,p 2 ) :=Y>fc sup \B pi h(Xx, ■ ■ ■ ,X k ) - E^ftfYi, . . . ,Y k )\, (1) 

where w k , k E N is a sequence of positive summable real weights (e.g. Wk — 1/k 2 ). 
Lemma 1. Z?h is a metric if and only if d-u h is a metric for every k E N. 

Proof. The statement follows from the fact that two process distributions are the same if and only if 
all their finite-dimensional marginals coincide. □ 

Definition 2 (empirical telescope distance D). For a pair of samples X\,_ n and Y\.. m define empir- 
ical telescope distance as 



Dn(Xi.. n ,Yi.. m ) :— 

min{m,n} 

w k sup 
k=i heHk 



- k 



I n— fe+1 ^ m— 

, . -i y. h( x i~i+k-i) rvr 2_. h{Y LA+k _x) 

k + 1 m — k + 1 f— ' 



(2) 



All the methods presented in this work are based on the empirical telescope distance. The key fact 
is that it is an asymptotically consistent estimate of the telescope distance, that is, the latter can be 
consistently estimated based on sampling. 

Theorem 1. Let H = (Hi, H2-, ■ ■ ■), T~Lk C X k , k E N be a sequence of separable sets of indicator 
functions of finite VC dimension such that Hk generates T X k. Then, for every stationary ergodic 
time series distributions px and py generating samples X\.. n and Y\„ m we have 

lim Dn(Xx..„,Yi„ m ) = D H (p x ,p Y ) (3) 

n,m— >oo 



Note that £>h is a biased estimate of Dn, and, unlike in the i.i.d. case, the bias may depend on the 
distributions; however, the bias is o(n). 

Remark. The condition that the sets Hk are sets of indicator function of finite VC dimension 
comes from [2|, where it is shown that for any stationary ergodic distribution p, under these 
conditions, sup he -u k n Jk+i Yll=i +1 is an asymptotically consistent estimate of 

E p h(Xi, . . . ,Xk). This fact implies that d-u can be consistently estimated, from which the the- 
orem is derived. 

Proof of Theorem^ As is established in [2|, under the conditions of the theorem we have 

_^ n— fc+1 

lim sup — — V" h(Xi..i+k-i) = sup E p h{X 1 ,...,X h ) p x -a.s. (4) 

heu k n-k + 1 ^ henk 



3 



for all 



k G N, and likewise for py. Fix an e > 0. We can find a T e N such that 



Y w k<e. (5) 

k>T 

Note that T depends only on e. Moreover, as follows from Q, for each k = 1..T we can find an Nk 



such that 



I n— fc+1 

sup — — V" - sup E Px /i(Xi.. fe ) 

ftGWfe n-k + 1 henk 

Let iVfe := maxi = i..T A^ and define analogously M for py. Thus, for n > N, m > M we have 

_^ n— fe+1 - m— fe+1 

TTT H x i~i+k-i) r-rT ^(^..i+fe-i 

. — fc + 1 m — k + 1 ' 

i=l — i 



< e/T 



(6) 



< V" w k sup 



fe=i 



i=l 



T r n-fc+l 

<V"w fc sup < — — /i(Xi.. i+fe _i) - E Pl h(X 1 ,. 

+ |E pl /i(X 1 .. fe )-E P2 /i(y 1 .. fc )| 

^ m— fc+1 



where the first inequality follows from the aennition or l>h {Z ) ana : 
follows from dS). Since e was chosen arbitrary the statement follows 



+ £ 

< 3e + D h (px,Py), 
definition of Dh (2) and from {5]) and the last inequality 



□ 



4 Calculating £) H using binary classification methods 

The methods for solving various statistical problems that we suggest are all based on Dh- The main 
appeal of this approach is that Dh can be calculated using binary classification methods. Here we 
explain how to do it. 

The definition ^ of Dh involves calculating I summands (where I := minjn, to}), that is 

m— fc+1 



sup 

heU k 



1 



-k 



n-k + 1 

7 £ 

i=l 



h{X, 



i..i+k-l 



)- 



1 



-k 



"J Y h ( Y *..i+k-l] 



(7) 



for each k = Assuming that h € Hk are indicator functions, calculating each of the summands 
amounts to solving the following A; -dimensional binary classification problem. Consider Xi..i+k-i> 
i = l..n — k + 1 as class-1 examples and ii..i+fc_i, i — l..m — k + 1 as class-0 examples. The 
supremum ^ is attained on h E Hk that minimizes the empirical risk, with examples wighted with 
respect to the sample size. Indeed, then we can define the weighted empirical risk of any h g Hk as 



ra-fc+1 



— Y C 1 - h{X LA+k ^)) 



1 



m— fc+1 

1 E 

i=l 



h(Yi 



i..i+fe— l y 



which is obviously minimized by any /i e Hfc that attains d7). 

Thus, as long as we have a way to find h e 7^, that minimizes empirical risk, we have a consistent 
estimate of Du(px, Py), under the mild conditions on H required by Theorem [T] Since the di- 
mension of the resulting classification problems grows with the length of the sequences, one should 
prefer methods that work in high dimensions, such as soft-margin SVMs (6). 

A particularly remarkable feature is that the choice of T-Lk is much easier for the problems that we 
consider than in the binary classification problem. Specifically, if (for some fixed k) the classifier 
that achieves the minimal (Bayes) error for the classification problem is not in Hk, then obviously 
the error of an empirical risk minimizer will not tend to zero, no matter how much data we have. In 
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contrast, all we need to achieve asymptotically error in estimating D (and therefore, in the learning 
problems considered below) is that the sets Hk asymptotically generate Fx* and have a finite VC 
dimension (for each k). This is the case already for the set of hyperplanes in K^! Thus, while the 
choice of Hk (or, say, of the kernel to use in S VM) is still important from the practical point of view, 
it is almost irrelevant for the theoretical consistency results. Thus, we have the following. 

Claim 1. The approximation error \Dn(P, Q) — Dn(X, Y)\, and thus the error of the algorithms 
below, can be much smaller than the error of classification algorithms used to calculate Dn(X, Y). 

Finally, we remark that while in ^ the number of summands is I, it can be replaced with any 7; such 
that 7; — > 00, without affecting any asymptotic consistency results. A practically viable choice is 
7; = log I; in fact, there is no reason to choose faster growing j n since the estimates for higher-order 
summands will not have enough data to converge. This is also the value we use in the experiments. 

5 The three-sample problem 

We start with a conceptually simple problem known in statistics as the three-sample problem (some 
times also called time-series classification). We are given three samples X = (Xi, . . . , X n ), 
Y = (Yi, . . . , Y m ) and Z = (Z\, . . . , Z{). It is known that X and Y were generated by differ- 
ent time-series distributions, whereas Z was generated by the same distribution as either X or Y. It 
is required to find out which one is the case. Both distributions are assumed to be stationary ergodic, 
but no further assumptions are made about them (no independence, mixing or memory assump- 
tions). The three sample-problem for dependent time series has been addressed in [9] for Markov 
processes and in |23| for stationary ergodic time series. The latter work uses an approach based on 
the distributional distance. 

Indeed, to solve this problem it suffices to have consistent estimates of some distance between time 
series distributions. Thus, we can use the telescope distance. The following statement is a simple 
corollary of Theorem [T] 

Theorem 2. Let the samples X = (Xi, . . . , X n ), Y = (Yi, . . . , Y m ) and Z = [Z\, . . . , Z{) be 
generated by stationary ergodic distributions px , Py and pz, with px ^ Py and either ( i) pz = px 
or (ii) pz — Py- Assume that the sets Hk C X k , k € N are separable sets of indicator functions of 
finite VC dimension such that Hk generates Tx k - A test that declares (i) if Dn(Z, X) < Dn(Z, Y) 
and (ii) otherwise makes only finitely many errors with probability 1 as n,m,l —¥ 00. 

It is straightforward to extend this theorem to more than two classes; in other words, instead of X 
and Y one can have an arbitrary number of samples from different stationary ergodic distributions. 

6 Clustering time series 

We are given N samples X 1 = (X{, . . . , X^), . . . , X N = (Xf , . . . , X% N ) generated by k dif- 
ferent stationary ergodic time-series distributions pi, . . . , pk. The number k is known, but the dis- 
tributions are not. It is required to group the N samples into k groups (clusters), that is, to output 
a partitioning of {X%..Xisr} into k sets. While there may be many different approaches to define 
what is a good clustering (and, in general, deciding what is a good clustering is a difficult problem), 
for the problem of classifying time-series samples there is a natural choice, proposed in |21 ]: those 
samples should be put together that were generated by the same distribution. Thus, define target 
clustering as the partitioning in which those and only those samples that were generated by the same 
distribution are placed in the same cluster. A clustering algorithm is called asymptotically consistent 
if with probability 1 there is an n' such that the algorithm produces the target clustering whenever 
maxi = i . jv rii > n! . 

Again, to solve this problem it is enough to have a metric between time-series distributions that can 
be consistently estimated. Our approach here is based on the telescope distance, and thus we use D. 

The clustering problem is relatively simple if the target clustering has what is called the strict sepa- 
ration property ||4): every two points in the same target cluster are closer to each other than to any 
point from a different target cluster. The following statement is an easy corollary of Theorem[T] 
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Theorem 3. Assume that the sets Hk C X k , k £ N are separable sets of indicator functions of finite 
VC dimension, such that T-ik generates J°x k - If the distributions p±, . . . , pk generating the samples 
X 1 = (X\ : . . . , ), . . . , X N = (X^ , . . . , X^ N ) are stationary ergodic, then with probability 1 
from some n :— max.; = i. jv rii on the target clustering has the strict separation property with respect 
toDn. 

With the strict separation property at hand, it is easy to find asymptotically consistent algorithms. 
We will give some simple examples, but the theorem below can be extended to many other distance- 
based clustering algorithms. 

The average linkage algorithm works as follows. The distance between clusters is defined as the 
average distance between points in these clusters. First, put each point into a separate cluster. Then, 
merge the two closest clusters; repeat the last step until the total number of clusters is k. The farthest 
point clustering works as follows. Assign c\ := X 1 to the first cluster. For i — 2..k, find the point 
X\ j E {1..N} that maximizes the distance min t=1 A D^X^ , c t ) (to the points already assigned 
to clusters) and assign a := X^ to the cluster i. Then assign each of the remaining points to the 
nearest cluster. The following statement is a corollary of Theorem[3] 

Theorem 4. Under the conditions of Theorem^ average linkage and farthest point clusterings are 
asymptotically consistent. 

Note that we do not require the samples to be independent; the joint distributions of the samples may 
be completely arbitrary, as long as the marginal distribution of each sample is stationary ergodic. 
These results can be extended to the online setting in the spirit of iTPJl . 

7 Speed of convergence 

The results established so far are asymptotic out of necessity: they are established under the as- 
sumption that the distributions involved are stationary ergodic, which is too general to allow for 
any meaningful finite-time performance guarantees. Moreover, some statistical problems, such as 
homogeneity testing or clustering when the number of clusters is unknown, are provably impossible 
to solve under this assumption l22ll . 

While it is interesting to be able to establish consistency results under such general assumptions, it 
is also interesting to see what results can be obtained under stronger assumptions. Moreover, since 
it is usually not known in advance whether the data at hand satisfies given assumptions or not, it 
appears important to have methods that have both asymptotic consistency in the general setting and 
finite-time performance guarantees under stronger assumptions. 

In this section we will look at the speed of convergence of D under certain mixing conditions, and 
use it to construct solutions for the problems of homogeneity and clustering with an unknown num- 
ber of clusters, as well as to establish finite-time performance guarantees for the methods presented 
in the previous sections. 

A stationary distribution on the space of one-way infinite sequences (X®,Fj$) can be uniquely 
extended to a stationary distribution on the space of two-way infinite sequences (X z , T%) of the 
form . . . , X_i, Xq, X\, 

Definition 3 (/3-mixing coefficients). For a process distribution p define the mixing coefficients 
P(f,k):= sup \p{AC\B)-p{A)p{B)\ 

Aecr(X_ oc .. ), 

where <r(..) denotes the sigma-algebra of the random variables in brackets. 

When (3(p, k) the process p is called absolutely regular; this condition is much stronger than 
ergodicity, but is much weaker than the i.i.d. assumption. 

7.1 Speed of convergence of D 

Assume that a sample X\, M is generated by a distribution p that is uniformly /3-mixing with coeffi- 
cients f3(p, k) Assume further that Hk is a set of indicator functions with a finite VC dimension dk, 
for each k € N. 
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The general tool that we use to obtain performance guarantees in this section is the following bound 
that can be obtained from the results of fl2l . 

q n (p,Hk,£) ■= p( sup \ — — V" h(X L . i+ k-i) - E Pl /i(Xi.. fc )| > e 



<n/3(p,i„-fc) + 8^ +1 e^" e2/8 , (8) 

where t n and Z„ are any integers in l..n. The parameters t n , l n should be set according to the values 
of P in order to optimize the bound. 



One can use similar bounds for classes of finite Pollard dimension |18] or more general bounds 
expressed in terms of covering numbers, such as those given in fl2l . Here we consider classes 
of finite VC dimension only for the ease of the exposition and for the sake of continuity with the 
previous section (where it was necessary). 

Furthermore, for the rest of this section we assume geometric /3-mixing distributions, that is, 
P{p, t) < 7' for some 7 < 1. Letting l n = t n = ^/n the bound ^ becomes 

q n (p,H k ,e) < ni ^- k +8n( dfc+1 )/ 2 e-v / " e2 / 8 . (9) 

Lemma 2. Let two samples X\__ n and Yi.. TO be generated by stationary distributions px and py 
whose (3-mixing coefficients satisfy ft(p,,t) < 7* for some 7 < 1. Let Hk, k G N be some sets of 
indicator functions on X k whose VC dimension d k is finite and non-decreasing with k. Then 

P(|£> H (^i..n,n.. m ) - D h (px,Py)\ >e)< 2A(e/4,n') (10) 
where n! := min{ni, r^}, the probability is with respect to px X Py and 

A(e,n) := -loge(n7^™ +log ^ +8n( d - 1 °^ +1 ^ 2 e -^™ £2/8 ). (11) 

Proof. Note that J^feL- ioge/2 w k < £ /2- Using this and the definitions (1) and (2) of Dh and Z?h 
we obtain 

-log(s/2) 

PdDniXun^YL.^) - D n (p x ,p Y )\ > e) < ^ (q n (px,H k ,e/A) + q n {p Y ,H k ,e/4)), 

k=l 

which, together with (6)implies the statement. □ 
7.2 Homogeneity testing 

Given two samples Xi..„ and Yi.. m generated by distributions px and py respectively, the problem 
of homogeneity testing (or the two-sample problem) consists in deciding whether px = Py- A test 
is called (asymptotically) consistent if its probability of error goes to zero as n' := min{m, n} goes 
to infinity. In general, for stationary ergodic time series distributions, there is no asymptotically 
consistent test for homogeneity 1221 . so stronger assumptions are in order. 

Homogeneity testing is one of the classical problems of mathematical statistics, and one of the most 
studied ones. Vast literature exits on homogeneity testing for i.i.d. data, and for dependent processes 
as well. We do not attempt to survey this literature here. Our contribution to this line of research is 
to show that this problem can be reduced (via the telescope distance) to binary classification, in the 
case of strongly dependent processes satisfying some mixing conditions. 

It is easy to see that under the mixing conditions of Lemma 1 a consistent test for homogeneity exists, 
and finite-sample performance guarantees can be obtained. It is enough to find a sequence e„ — > 



such that A(e„, n) — > (see ( 1 1 1. Then the test can be constructed as follows: say that the two se- 
quences Xx..n and Yi.. m were generated by the same distribution if Dn(Xi.. n , Yj... m ) < £ m in{n,m}; 
otherwise say that they were generated by different distributions. The following statement is an im- 
mediate consequence of Lemma|2] 

Theorem 5. Under the conditions of Lemma^the probability of Type I error ( the distributions are 
the same but the test says they are different) of the described test is upper-bounded by 4A(e/8, n'). 
The probability of Type II error ( the distributions are different but the test says they are the same) is 
upper-bounded by 4A(<5 — e/8,n') where S := l/2Du(px 1 Py)- 
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The optimal choice of e n may depend on the speed at which dk (the VC dimension of Hk) increases; 
however, for most natural cases (recall that Hk are also parameters of the algorithm) this growth is 
polynomial so the main term to control is e _v/ ™ £ / 8 . 

For example, if Hk is the set of halfspaces in X k — R k then dk = k + 1 and one can chose 
e n := n -1 / 8 . The resulting probability of Type I error decreases as exp(— n 1 / 4 ). 

7.3 Clustering with a known or unknown number of clusters 

If the distributions generating the samples satisfy certain mixing conditions, then we can augment 
Theorems [3] and |4] with finite-sample performance guarantees. 

Theorem 6. Let the distributions pi,...,pk generating the samples X 1 = 
(X 1 , X^ ),•-., X N = (X^,...,X^ N ) satisfy the conditions of Lemma |5j Define 
5 := mirij i j- = i..jv,i^j Dn{pi, Pj) and n := minj = i.jv Uj. Then with probability at least 

1 - N(N- l)A(£/4,»)/2 

the target clustering of the samples has the strict separation property. In this case single linkage 
and farthest point algorithms output the target clustering. 

Proof. Note that a sufficient condition for the strict separation property to hold is that for every one 
out of N(N ~ l)/2pairs of samples the estimate D- H .(X i ,X j ) i,j = 1..N is within 6/4 of the _D H 
distance between the corresponding distributions. It remains to apply Lemma [2] to obtain the first 
statement, and the second statement is obvious (cf. Theorem[4]). □ 

As with homogeneity testing, while in the general case of stationary ergodic distributions it is im- 
possible to have a consistent clustering algorithm when the number of clusters k is unknown, the 
situation changes if the distributions satisfy certain mixing conditions. In this case a consistent clus- 
tering algorithm can be obtained as follows. Assign to the same cluster all samples that are at most 
e„-far from each other, where the threshold e„ is selected the same way as for homogeneity testing: 
e n and A(e n , n) — > 0. The optimal choice of this parameter depends on the choice of Hk 
through the speed of growth of the VC dimension dk of these sets. 

Theorem 7. Given N samples generated by k different stationary distributions pi, i = l..k (un- 
known kj all satisfying the conditions of Lemma^ the probability of error (misclustering at least 
one sample) of the described algorithm is upper-bounded by 

2N(N - 1) max{A(e/8, n), A(6 - e/8, n)} 

where S :— whii y j = i„k,ijij Dn(pi, pf) and n = mini = i.jvni, with ni, i = I..N being lengths of 
the samples. 

8 Experiments 

For experimental evaluation we chose the problem of time-series clustering. Average-linkage clus- 
tering is used, with the telescope distance between samples calculated using an SVM, as described 
in Section |4] In all experiments, SVM is used with radial basis kernel, with default parameters of 
libsvm (5). 

8.1 Synthetic data 

For the artificial setting we have chosen highly-dependent time series distributions which have the 
same single-dimensional marginals and which cannot be well approximated by finite- or countable- 
state models. The distributions p(a), a € (0, 1), are constructed as follows. Select ro € [0, 1] 
uniformly at random; then, for each i = l..n obtain r, by shifting rj_i by a to the right, and 
removing the integer part. The time series (X% , X 2 , . . . ) is then obtained from r , by drawing a point 
from a distribution law J\T\ if < 0.5 and from Mi otherwise. M\ is a 3-dimensional Gaussian with 
mean of and covariance matrix Id x 1/4. M2 is the same but with mean 1. If a is irrationaQthen the 

'in experiments simulated by a longdouble with a long mantissa 
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distribution p(a) is stationary ergodic, but does not belong to any simpler natural distribution family 
E51 . The single-dimensional marginal is the same for all values of a. The latter two properties 
make all parametric and most non-parametric methods inapplicable to this problem. 

In our experiments, we use two process distributions p(cti),i e {1,2}, with ol\ = 0.31..., a-i — 
0.35...,. The dependence of error rate on the length of time series is shown on Figure [T] One 
clustering experiment on sequences of length 1000 takes about 5 min. on a standard laptop. 

8.2 Real data 

To demonstrate the applicability of the proposed methods to realistic scenarios, we chose the brain- 
computer interface data from BCI competition III ifTTl . The dataset consists of (pre-processed) 
BCI recordings of mental imagery: a person is thinking about one of three subjects (left foot, right 
foot, a random letter). Originally, each time series consisted of several consecutive sequences of 
different classes, and the problem was supervised: three time series for training and one for testing. 
We split each of the original time series into classes, and then used our clustering algorithm in a 
completely unsupervised setting. The original problem is 96-dimensional, but we used only the first 
3 dimensions (using all 96 gives worse performance). The typical sequence length is 300. The 
performance is reported in Table [T] labeled TSsvm- All the computation for this experiment takes 
approximately 6 minutes on a standard laptop. 

The following methods were used for comparison. First, we used dynamic time wrapping (DTW) 
11241 which is a popular base-line approach for time-series clustering. The other two methods in 
Table 1 are from [ 10]. The comparison is not fully relevant, since the results in [ 10 1 are for different 
settings; the method KCpA was used in change-point estimation method (a different but also un- 
supervised setting), and SVM was used in a supervised setting. The latter is of particular interest 
since the classification method we used in the telescope distance is also SVM, but our setting is 
unsupervised (clustering). 




200 400 600 800 1000 1200 



Time of observation 

Figure 1: Error of two-class clustering using 
TSsvm; 10 time series in each target cluster, av- 
eraged over 20 runs. 





Sl 


S2 


S3 


TSsvm 


84% 


81% 


61% 


DTW 


46% 


41% 


36% 


KCpA 


79% 


74% 


61% 


SVM 


76% 


69% 


60% 



Table 1: Clustering accuracy in the BCI 
dataset. 3 subjects (columns), 4 methods 
(rows). Our method is TSsvm- 
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